Search CORE

74 research outputs found

Variation in the Marking of Text Organization in French Research Articles: From Short and Specific to Extended and Vague

Author: Laippala Veronika
Publication venue: 'OpenEdition'
Publication date: 22/09/2017
Field of study

Adopting a text linguistic, corpus-based approach, this article studies variation in the marking of text organization. Why is text organization sometimes signaled very precisely, while sometimes signaling does not occur at all? The focus is on a particular mode of text organization, taking the form of text sequences, i.e. structures at least partially signaled by markers of addition or order, such as first, the last example. The material consists of 90 research articles in French with manually XML-annotated text sequences (XML = Extensible Markup Language). The results highlight the variation in the marking and show several factors affecting it. In shorter sequences, the marking is typically explicit and precise, while in longer ones explicit marking is more often omitted; when used, only vague markers, signaling simple addition, are present. In addition, different markers tend to be used in the signaling of sequences of different lengths.En adoptant une approche textuelle et quantitative, cet article examine la variation du marquage de l’organisation textuelle. Pourquoi l’organisation textuelle est-elle parfois indiquée très précisément, alors que parfois elle n’est pas signalée du tout ? L’article se concentre sur un mode particulier de l’organisation textuelle, les séries linéaires, qui sont des structures dont les items sont au moins partiellement signalés par des marqueurs d’addition, d’ordre, ou de progression, tels que d’abord, le dernier exemple. Le matériel consiste en 90 articles de recherche en français, annotés en XML (Extensible Markup Language). Les résultats soulignent le rôle de la variation dans le marquage et montrent que plusieurs facteurs textuels entrent en jeu. Dans des séries linéaires plutôt courtes, le marquage est typiquement explicite et précis, tandis que dans des séries plutôt longues, le marquage est plus souvent absent ou seulement des marqueurs vagues sont utilisés. De plus, la longueur de la série linéaire a un effet sur le type de marqueur utilisé

Directory of Open Access Journals

OpenEdition

Variation in the marking of text organization in French research articles: from short and specific to extended and vague.

Author: Laippala Veronika
Publication venue: 'OpenEdition'
Publication date: 28/10/2022
Field of study

UTUPub

Les discussions Wikipedia : un corpus pour caractériser le genre « discussion »

Author: Ho-Dac Lydia-Mai
Laippala Veronika
Publication venue: HAL CCSD
Publication date: 23/10/2015
Field of study

International audienceCette présentation propose une description des caractéristiques intra-linguistiques des discussions Wikipedia, forum de discussion associé à chaque article de l'encyclopédie Wikipedia. Après un exposé des propriétés qui font de ces textes un objet d'étude particulièrement intéressant pour les linguistiques de corpus, nous présenterons la procédure de constitution du corpus de discussion et une première description quantitative du corpus constitué. Nous finirons sur une présentation rapide d'un ensemble d'études linguistiques envisagées sur ce corpus

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

D’abord, ensuite, enfin et 0, De plus: Organisation textuelle par des séries linéaires dans les articles de recherche

Author: Laippala Veronika
Publication venue: fi=Turun yliopisto|en=University of Turku|
Publication date: 05/12/2011
Field of study

The study examines the signalling of text organisation in research articles (RA) in French. The work concentrates on a particular type of organisation provided by text sequences, i.e. structures organising text to items of which at least some are signalled by markers of addition or order: First… 0… The third point… In addition… / Premièrement… 0… Le troisième point… De plus… By indicating the way the text is organised, these structures guide the reader in the reading process so that he doesn’t need to interpret the text structure himself. The aim of the work is to study factors affecting the marking of text sequences. Why is their structure sometimes signalled explicitly by markers such as secondly, whereas in other places such markers are not used? The corpus is manually XML-annotated and consists of 90 RAs (~800 000 words) in French from the fields of linguistics, education and history. The analysis highlights several factors affecting the marking of text sequences. First, exact markers (such as fist ) seem to be more frequent in sequences where all the items are explicitly signalled by a marker, whereas additive markers (such as moreover) are used in sequences with both explicitly signalled and unmarked items. The marking of explicitly signalled sequences seems thus to be precise and even repetitive, whereas the signalling of sequences with unmarked items is altogether more vague. Second, the marking of text sequences seems to depend on the length of the text. The longer the text segment, the more vague the marking. Additive markers and unmarked items are more frequent in longer sequences possibly covering several pages, whereas shorter sequences are often signalled explicitly by exact markers. Also the marker types vary according to the sequence length. Anaphoric expressions, such as first, are fairly close to their referents and are used in short sequences, connectors, such as secondly, are frequently used in sequences of intermediate length, whereas the longest sequences are often signalled by constructions composed of an ordinal and a noun acting as a subject of the sentence: The first item is… Finally, the marking of text organisation depends also on the discipline the RA belongs to. In linguistics, the marking is fairly frequent and precise; exact markers such as second are the most used, and structures with unmarked items are less common. Similarly, the marking is fairly frequent in education. In this field, however, it is also less precise than in linguistics, with frequent unmarked items and additive markers. History, on the other hand, is characterised by less frequent marking. In addition, when used, the marking in this field is also less precise and less explicit.Siirretty Doriast

UTUPub

Korpusaineistot

Author: Laippala Veronika
Palander-Collin Minna
Publication venue: Suomalaisen Kirjallisuuden Seura
Publication date: 01/01/2020
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Parsing Clinical Finnish: Experiments with Rule-Based and Statistical Dependency Parsers

Author: Ginter Filip
Haverinen Katri
Laippala Veronika
Salakoski Tapio
Publication venue
Publication date: 13/05/2009
Field of study

Proceedings of the 17th Nordic Conference of Computational Linguistics NODALIDA 2009. Editors: Kristiina Jokinen and Eckhard Bick. NEALT Proceedings Series, Vol. 4 (2009), 65-72. © 2009 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/9206

DSpace at Tartu University Library

Affectivity in the #jesuisCharlie Twitter discussion

Author: Marjut Johansson
Veronika Laippala
Publication venue: 'John Benjamins Publishing Company'
Publication date: 27/10/2022
Field of study

UTUPub

Predictive keywords: Using machine learning to explain document characteristics

Author: Aki-Juhani Kyröläinen
Veronika Laippala
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2023
Field of study

When exploring the characteristics of a discourse domain associated with texts, keyword analysis is widely used in corpus linguistics. However, one of the challenges facing this method is the evaluation of the quality of the keywords. Here, we propose casting keyword analysis as a prediction problem with the goal of discriminating the texts associated with the target corpus from the reference corpus. We demonstrate that, when using linear support vector machines, this approach can be used not only to quantify the discrimination between the two corpora, but also extract keywords. To evaluate the keywords, we develop a systematic and rigorous approach anchored to the concepts of usefulness and relevance used in machine learning. The extracted keywords are compared with the recently proposed text dispersion keyness measure. We demonstrate that that our approach extracts keywords that are highly useful and linguistically relevant, capturing the characteristics of their discourse domain

Directory of Open Access Journals

French Wikipedia Talk Pages: Profiling and Conflict Detection

Author: Ho-Dac Lydia-Mai
Laippala Veronika
Poudat Céline
Tanguy Ludovic
Publication venue: HAL CCSD
Publication date: 27/09/2016
Field of study

International audienceWikipedia is a popular and extremely useful resource for studies in both linguistics and natural language processing (Yano and Kang, 2008; Ferschke et al., 2013). This paper introduces a new language resource based on the French Wikipedia online discussion pages, the WikiTalk corpus. The publicly available corpus includes 160M words and 3M posts structured into 1M thematic sections and has been syntactically parsed with the Talismane toolkit (Urieli, 2013). In this paper, we present the first results of experiments aiming at classifying and profiling the talk pages and threads in order to determine criteria for selecting discussions with conflicts

Scientific Publications of the University of Toulouse II Le Mirail

HAL-UNICE

HAL Descartes

Toxicity Detection in Finnish Using Machine Translation

Author: Eskelinen Anni
Ginter Filip
Laippala Veronika
Pyysalo Sampo
Silvala Laura
Publication venue: University of Tartu Library
Publication date: 01/05/2023
Field of study

DSpace at Tartu University Library